1 Hierarchical agglomerative clustering

Hierarchical agglomerative clustering is a “bottom-up” method of clustering. Each observation begins as its own cluster and forms clusters with like items as it moves up the hierarchy. That is, all leaves are their own clusters to begin with and form clusters as we move up the trunk and various branches are formed.

Distance and cluster method information are usually displayed at the bottom of the graph, while the vertical axis displays the height, which refers to the distance between two clusters. We are not concerned as much with distances along the horizontal axis. We can also “cut” the dendrogram to specify a number of clusters, which is similar to defining k in k-means clustering (which is also equally problematic).

In a real-life research situation, you will likely want to scale the data. However, raw data are used in this example. # Package installation

if (FALSE) {
  # Run this line manually (once) to install the necessary packages.
  # Install packages from CRAN:
  install.packages(c("ape", "pvclust", "mclust"))
}

# fancy dendrogram options
library(ape)
# dendrograms with p-values
library(pvclust)
# model-based clustering
library(mclust)
## Package 'mclust' version 5.4.1
## Type 'citation("mclust")' for citing this R package in publications.

2 Load data

data(mtcars)
?mtcars

Start by using the hclust built-in function from {stats}. hclust prefers a dissimilarity matrix via the dist function, thus it plots rows as opposed to columns like the methods further below.

3 The hclust built-in function

# See the help files
?hclust

# Create distance matrix
mtcars_dist = dist(mtcars, method = "euclidean")

# Fit hclust_model
system.time({
  hclust_model = hclust(mtcars_dist, method = "complete")
  })
##    user  system elapsed 
##   0.000   0.000   0.001
# Plot hclust_model dendrogram
plot(hclust_model, hang = -1)

Data are visualized in dendrograms, or branching tree-like structures similar to decision trees, albeit with less information displayed at each node. The most similar items are found lower in the dendrogram and fuse into \(n-1\) clusters as we move up the tree; the next two items to fuse into a cluster produce \(n-2\) clusters and so on as we move up the tree until there is just one overarching cluster. Thus, clusters become more inclusive as we move up the hierarchy.

Dissimilarity is applied not just to single observations, but to groups as well (linkage). Thus the “Cadillac Fleetwood / Lincoln Continental” cluster " cluster fuses with “Chrysler Imperial” instead of “Maserati Bora” or something else.

You can also cut the tree to see how the tree varies:

# If we want only 5 clusters, for example (must be a number between 1-32, since mtcars has only 32 observations:
cutree(hclust_model, 5) 
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive   Hornet Sportabout             Valiant          Duster 360 
##                   1                   1                   1                   2                   3                   2                   3 
##           Merc 240D            Merc 230            Merc 280           Merc 280C          Merc 450SE          Merc 450SL         Merc 450SLC 
##                   1                   1                   1                   1                   2                   2                   2 
##  Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla       Toyota Corona 
##                   3                   3                   3                   4                   4                   4                   1 
##    Dodge Challenger         AMC Javelin          Camaro Z28    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   2                   3                   3                   4                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   3                   1                   5                   1

4 The ape package

The ape package provides some great functionality for constructing and plotting clusters:

library(ape)
# various plots
plot(as.phylo(hclust_model))

plot(as.phylo(hclust_model), type = "cladogram")

plot(as.phylo(hclust_model), type = "unrooted")

# radial plot
colors = c("red", "orange", "blue", "green", "purple")
clus5 = cutree(hclust_model, 5)
plot(as.phylo(hclust_model), type = "fan", tip.color = colors[clus5], lwd = 2, cex = 1)

NOTE: the color settings for the radial plot apply to the other ape plots as well.

5 The pvclust package

The pvclust package offers a straightfoward way to perform hierarchical agglomerative clustering of columns with two types of p-values at each split: approximately unbiased (AU) and bootstrap probability (BP).

library(pvclust)
# Cluster features

# Ward's method: minimum variance between clusters
system.time({
  pvclust_model_ward = pvclust(mtcars, 
                          method.hclust = "ward.D",
                          method.dist = "euclidean",
                          nboot = 1000, parallel = T)
  })
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
##    user  system elapsed 
##   0.030   0.014   4.312
plot(pvclust_model_ward)

# pvrect will draw rectangles around clusters with high or low p-values
pvrect(pvclust_model_ward, alpha = 0.95)

5.0.1 Compare different dissimilarity measures

# Complete linkage: largest intercluster difference
system.time({
  pvclust_model_complete = pvclust(mtcars, 
                          method.hclust = "complete",
                          method.dist = "euclidean",
                          nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
##    user  system elapsed 
##   0.023   0.013   4.234
# Single linkage: smallest intercluster difference
system.time({
  pvclust_model_single = pvclust(mtcars, 
                          method.hclust = "single",
                          method.dist = "euclidean",
                          nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
##    user  system elapsed 
##   0.024   0.012   4.128
# Average linkage: mean intercluster difference
system.time({
  pvclust_model_average = pvclust(mtcars, 
                          method.hclust = "average",
                          method.dist = "euclidean",
                          nboot = 1000, parallel = T)
})
## Creating a temporary cluster...done:
## socket cluster with 7 nodes on host 'localhost'
## Multiscale bootstrap... Done.
##    user  system elapsed 
##   0.024   0.014   4.118
# View summaries
pvclust_model_ward
## 
## Cluster method: ward.D
## Distance      : euclidean
## 
## Estimates on edges:
## 
##       au    bp se.au se.bp      v      c  pchi
## 1  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 2  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 3  0.791 0.837 0.026 0.004 -0.895 -0.086 0.221
## 4  0.992 0.998 0.009 0.001 -2.667 -0.257 0.705
## 5  0.953 0.999 0.168 0.001 -2.476 -0.797 1.000
## 6  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 7  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 8  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 9  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 10 1.000 1.000 0.000 0.000  0.000  0.000 0.000
pvclust_model_complete
## 
## Cluster method: complete
## Distance      : euclidean
## 
## Estimates on edges:
## 
##       au    bp se.au se.bp      v      c  pchi
## 1  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 2  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 3  0.595 0.604 0.030 0.005 -0.251 -0.012 0.545
## 4  1.000 1.000 0.001 0.000 -3.331 -0.016 0.738
## 5  0.932 0.836 0.012 0.004 -1.233  0.256 0.216
## 6  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 7  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 8  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 9  0.976 0.999 0.040 0.001 -2.476 -0.504 0.842
## 10 1.000 1.000 0.000 0.000  0.000  0.000 0.000
pvclust_model_single
## 
## Cluster method: single
## Distance      : euclidean
## 
## Estimates on edges:
## 
##       au    bp se.au se.bp      v      c  pchi
## 1  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 2  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 3  0.639 0.591 0.029 0.005 -0.293  0.064 0.558
## 4  0.993 1.000 0.024 0.000 -2.916 -0.466 0.560
## 5  0.852 0.926 0.026 0.003 -1.245 -0.199 0.334
## 6  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 7  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 8  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 9  0.969 0.973 0.011 0.002 -1.899 -0.033 0.158
## 10 1.000 1.000 0.000 0.000  0.000  0.000 0.000
pvclust_model_average
## 
## Cluster method: average
## Distance      : euclidean
## 
## Estimates on edges:
## 
##       au    bp se.au se.bp      v      c  pchi
## 1  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 2  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 3  0.534 0.491 0.031 0.005 -0.032  0.054 0.561
## 4  0.911 0.999 0.171 0.001 -2.227 -0.880 0.673
## 5  0.885 0.864 0.018 0.004 -1.151  0.052 0.397
## 6  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 7  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 8  1.000 1.000 0.000 0.000  0.000  0.000 0.000
## 9  0.994 0.997 0.005 0.001 -2.625 -0.086 0.888
## 10 1.000 1.000 0.000 0.000  0.000  0.000 0.000
# Plot Euclidean distance linkages
par(mfrow = c(2,2))
plot(pvclust_model_ward, main = "Ward", xlab = "", sub = "")
pvrect(pvclust_model_ward)
plot(pvclust_model_complete, main = "Complete", xlab = "", sub = "")
pvrect(pvclust_model_complete)
plot(pvclust_model_single, main = "Single", xlab = "", sub = "")
pvrect(pvclust_model_single)
plot(pvclust_model_average, main = "Average", xlab = "", sub = "")
pvrect(pvclust_model_average)

par(mfrow = c(1,1))

5.0.2 View standard error plots:

par(mfrow=c(2,2))
seplot(pvclust_model_ward, main = "Ward")
seplot(pvclust_model_complete, main = "Complete")
seplot(pvclust_model_single, main = "Single")
seplot(pvclust_model_average, main = "Average")

par(mfrow=c(1,1))

6 Going further - the mclust package

The mclust package provides “Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference.”

library(mclust)

# Fit model
mclust_model = Mclust(mtcars)

# View various plots
plot(mclust_model, what = "BIC") 

plot(mclust_model, what = "classification")

plot(mclust_model, what = "uncertainty")

plot(mclust_model, what = "density")

6.0.1 Return best performing model

summary(mclust_model)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEI (diagonal, equal shape) model with 8 components: 
## 
##  log.likelihood  n  df       BIC       ICL
##       -358.1832 32 113 -1107.995 -1107.995
## 
## Clustering table:
## 1 2 3 4 5 6 7 8 
## 2 3 5 7 8 2 3 2

6.0.2 Cross-validated mclust

# sort mpg in decreasing order
mtcars = mtcars[order(-mtcars$mpg),]
mtcars 
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
# create a binary factor variable from mpg: "less than 20mpg" and "greater than 20mpg"
mtcars$class = cut(mtcars$mpg, 
                   breaks = c(0, 20, 40),
                   levels = c(1, 2),
                   labels = c("less than 20mpg", "greater than 20mpg"))
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb              class
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 greater than 20mpg
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 greater than 20mpg
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 greater than 20mpg
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 greater than 20mpg
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 greater than 20mpg
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 greater than 20mpg
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 greater than 20mpg
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 greater than 20mpg
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 greater than 20mpg
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 greater than 20mpg
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 greater than 20mpg
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 greater than 20mpg
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 greater than 20mpg
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 greater than 20mpg
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6    less than 20mpg
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4    less than 20mpg
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2    less than 20mpg
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    less than 20mpg
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    less than 20mpg
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4    less than 20mpg
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3    less than 20mpg
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3    less than 20mpg
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4    less than 20mpg
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2    less than 20mpg
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3    less than 20mpg
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2    less than 20mpg
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8    less than 20mpg
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4    less than 20mpg
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    less than 20mpg
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4    less than 20mpg
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4    less than 20mpg
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4    less than 20mpg
# define our predictors (X) and class labels (class)
X = mtcars[ , -12]
class = mtcars$class

# fit the model (EEE covariance structure, basically the same as linear discriminant analysis)
mclust_model2 = MclustDA(X, class = class, modelType = "EDDA", modelNames = "EEE")

# cross-validate!
set.seed(1)
cv_mclust = cvMclustDA(mclust_model2, nfold = 20)

# View cross-validation error and standard error of the cv error
cv_mclust[c("error", "se")]
## $error
## [1] 0.09375
## 
## $se
## [1] 0.04519385

References and resources:
- Quick-R: Cluster Analysis
- James et al. Introduction to Statistical Learning, pp. 390-401
- pvclust
- STHDA: Beautiful dendrogram visualizations
- Gaston Sanchez: Visualizing Dendrograms in R
- Analysis of Phylogenetics and Evolution
- A Quick Tour of mclust
- mclust vignette (from 2012, but more detailed)
- A very useful walkthrough by Christian Lopez
- MoEClust: Gaussian Parsimonious Clustering Models with Gating and Expert Network Covariates
- See the cluster R package to learn more about agnes, clara, daisy, diana, fanny, flower, mona, and pam cluster methods!